Published online 12 June 2012 



Nucleic Acids Research, 2012, Vol. 40, Web Server issue W609-W614 

doi:10.1093/narlgks5 75 



CellBase, a comprehensive collection of RESTful 
web services for retrieving relevant biological 
information from heterogeneous sources 

Marta Bleda^'^, Joaquin Tarraga^ ^ Alejandro de Maria\ Francisco Salavert^'^, 
Luz Garcia-Alonso\ Matilde Celma'*, Ainoha Martin"*, Joaquin Dopazo^'^'^'* and 
Ignacio Medina^'^'* 

^Department of Bioinformatics and Genomics, Centra de Investigacion Principe Feiipe (CIPF), 46012 Valencia, 
Spain, ^CIBER de Enfermedades Raras (CIBERER), 46010 Valencia, Spain, ^Functional Genomics Node (INB) at 
CIPF, 46012 Valencia, Spain and "^Research Center on Software Production Methods (ProS). DSIC Universitat 
Politecnica de Valencia (UPV), 46007 Valencia, Spain 

Received March 23, 2012; Revised IVlay 18, 2012; Accepted IVlay 21, 2012 



ABSTRACT 

During the past years, the advances in high- 
throughput technologies have produced an unprece- 
dented growth in the number and size of repositories 
and databases storing relevant biological data. 
Today, there is more biological information than 
ever but, unfortunately, the current status of many 
of these repositories is far from being optimal. 
Some of the most common problems are that the 
information is spread out in many small databases; 
frequently there are different standards among 
repositories and some databases are no longer sup- 
ported or they contain too specific and unconnected 
information. In addition, data size is increasingly 
becoming an obstacle when accessing or storing 
biological data. All these issues make very difficult 
to extract and integrate information from different 
sources, to analyze experiments or to access and 
query this information in a programmatic way. 
CellBase provides a solution to the growing neces- 
sity of integration by easing the access to biological 
data. CellBase implements a set of RESTful web 
services that query a centralized database contain- 
ing the most relevant biological data sources. The 
database is hosted in our servers and is regularly 
updated. CellBase documentation can be found at 
http://docs.bioinfo.cipf.es/projects/cellbase. 

INTRODUCTION 

During the past years, the increase in scientific knowledge 
and the massive data production have caused an 



exponential growth in the number and size of biological 
databases and repositories. However, data size, which can 
reach hundreds of gigabytes, involves serious problems of 
data access through Internet and data storage in local 
disks. 

Other challenging issues associated to biological data 
are that much relevant information is spread out in 
different databases or repositories, different identifiers or 
standards are used and data can be very frequently 
updated as new experiments are conducted. This is a par- 
ticular problem when analyzing high-throughput experi- 
ments such as expression profiling, genotyping or massive 
sequencing data because much heterogeneous biological 
information is required for its interpretation. 

To address these daily problems, we have developed a 
comprehensive infrastructure that comprises a relational 
database containing biological information and a web 
services application programming interface (API) to 
query all these data. This relational database integrates 
biological information from different sources and includes 
(i) core features such as genes, transcripts and exons or 
proteins; (ii) regulatory elements such as transcription 
factors (TFs) and TF binding sites, microRNA 
(miRNA) and curated and non-curated niiRNA targets 
or CpG islands; (iii) many functional ontologies from 
Open Biomedical Ontologies (OBO) Foundry (1); (iv) 
variation data such as single-nucleotide polymorphisms 
(SNPs), phenotypic-related SNPs, known mutations or 
structural variation and (v) systems biology information 
such as pathways or protein interactome. 

Web services API have been implemented in represen- 
tational state transfer (REST) style that allows an easy, 
Ughtweight, fast and intuitive way of querying data in the 
database. Outputs can be obtained in plain tabulated text 
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or in JSON format. Database and RESTful web services 
have been designed and implemented to ensure liigh avail- 
ability of the servers and to be fast, what results in 
real-time queries most of the time. 

Our results provide a convenient solution to access and 
retrieve heterogeneous relevant biological information 
without the need of local databases installations. Data are 
always available by a high-availability cluster and queries 
have been tuned to ensure a real-time performance. 



BIOLOGICAL SOURCES 

CeUBase database and web services have been designed to 
integrate and provide easy and efficient access to the most 
relevant biological information. This access is provided 
through a comprehensive and extensible RESTful web 
services API. Currently, some of the model organisms 
supported are human, mouse, rat, zebrafish, worm, fruit 
fly, pig, dog and yeast (and soon new organisms will be 
added). 

CellBase integrates different data types from different 
sources into a relational database. These data comprise 
most relevant biological information taken from the 
main repositories. These data are organized in different 
sections depending on the type of information as described 
below. 

Core features 

We took genome sequences, genes, transcripts, exons, 
cytobands or cross references (xrefs) identifiers (IDs) 
from Ensembl (2). Protein information including se- 
quences, xrefs or protein features (natural variants, muta- 
genesis sites, post-translational modifications, etc.) were 
imported from UniProt (3). 

Regulatory 

CeUBase imports miRNA from miRBase (4); curated and 
non-curated miRNA targets from niiRecords (5), 
miRTarBase (6), TargetScan (7) and microRNA.org (8) 
and CpG islands and conserved regions from the UCSC 
database (9). 

Functional annotation 

OBO Foundry (1) develops many biomedical ontologies 
that are implemented in OBO format. We designed a SQL 
schema to store these OBO ontologies and >30 ontologies 
were imported. OBO ontology term annotations were 
taken from Ensembl (2). InterPro (10) annotations were 
also imported. 

Variation 

CellBase includes SNPs from dbSNP (11); SNP popula- 
tion frequencies from HapMap (12), 1000 genomes project 
(13) and Ensembl (2); phenotypically annotated SNPs 
were imported from NHRI GWAS Catalog (14), 
HGMD (15), Open Access GWAS Database (16), 
UniProt (3) and OMIM (17); mutations from COSMIC 
(18) and structural variations from Ensembl (2). 



Systems biology 

We also import systems biology information hke 
interactome infomiation from IntAct (19). Reactome 
(20) stores pathway and interaction information in 
BioPAX (21) format. BioPAX data exchange format 
enables the integration of diverse pathway resources. We 
successfully solved the problem of storing data released in 
BioPAX format into a SQL relational schema, which 
allowed us importing Reactome in CeUBase. 

TECHNICAL DETAILS 

Architecture design and implementation 

CellBase database and web services architecture have been 
designed to be both very fast and fault-tolerant, thus 
providing a high-availabihty solution with no single 
point of failure. This is an important feature that makes 
CellBase very reliable and scalable if more servers are 
needed. Figure 1 shows a schema of the architecture. 
Some of the advantages of this architecture are that no 
installation of software tools or databases are needed by 
the user, as the whole API has been implemented using 
RESTful web services, and thus, database and web 
services API are always available. 

Data from different biological databases and sources 
were integrated into a normalized relational database, im- 
plemented in a MySQL repHcation cluster to support a 
high load access as shown in Figure 1. In total, >200 
GB were stored in the database. To speed-up queries, 
indexes and summary tables were created, resulting in 
runtimes of a few milliseconds for most of the queries. 
In order to provide a high availability and load balancing 
to the MySQL replication cluster, a Keepalived (http:// 
www.keepalived.org) and HAProxy (http://haproxy.lwt 
.eu) services have been configured. 

Web services API have been designed and implemented 
using REST software architectural style. RESTful web 
services have some advantages over SOAP web services 
such as that they tend to be more lightweight, scalable 
and easy to build and consume, thus providing a fast 
access to data even in low bandwidth conditions. To 
achieve a higher performance, Java was chosen for the 
server implementation to both (i) connect to MySQL 
using JBoss Hibernate (http://www.hibernate.org) and 
(h) to implement RESTful web services API using Jersey 
Hbrary (http://jersey.java.net). Apache Tomcat was 
chosen as Java application server to deploy the web 
archive {war file) with the RESTful web services API im- 
plementation. To provide a high availability and load 
balancing to these web services, a HAProxy (http:// 
haproxy.lwt.eu) was set up to balance Apache Tomcat 
instances. RESTful web services are registered in 
BioCatalogue (22) and access is free to aU users. 

Servers 

CellBase MySQL replication cluster database is running in 
two high-end servers with two Intel Xeon Hexa-Core 
CPUs each, 96GB of memory and 6 SSD disks configured 
as a RAID-5 volume giving a 1,2TB of extremely fast 
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Data sources 



Core features: genes, transcripts, exons, proteins (UniProt), etc. 

Regulatory: TFBSs, miRNAs, regulatory regions, PWMs, 
conserved regions, etc. 

Functional annotation; OBO ontologies (Gene ontology, 

disease ontology, etc.), InterPro, etc. 

Variation: dbSNP, HapMap, 1000 Genomes project, COSMIC, 
protein variants, etc. 

Systems biology: IntAct , Reactome, gene co-expression, etc. 



MySQL cLu&ter 





Java RESTful Web Services API design 



Structure 



Categories 



ws . bioinf o . cipf . es/cellbase/rest/ {version} / {species } / {category } / {subcategory} / id/ {resource} ? {filters } 

genomic ws . bioinf o. cipf . es/cellbase/rest/ {version} / {species} /genomic/ {subcategory} / id/ {resource} 

Example: ws .hioinfo .cipf .es/ cellbase/ rest/ latest /hsa/ genomic/ region/ 13 : 3297 2105 -3297 3105 / snp 
Example: ws .hioinfo .cipf .es/ cellba.se/ rest / latest /hsa/ genomic /variant / 13 : 3297 2 105 :A/ 

feature ws . bioinf o, cipf ,es/celldb /rest/ {version}/ {species} / feature/ {subcategory} /id/ {resource} 

Example: ws .bioinf o .cipf .es/cellbase/ rest / latest /hs a/ feature /gene /BRCA2 ,BCL2 /transcript 
Example: ws .bioinf o .cipf .es/cellbase/ rest / latest /hsa/ feature/ id/ BRCA2 /xref?dbname-go 

regulatory ws.bioinfo.cipf. es/cellbase/rest/ {version } / {species} /regulatory/ {subcategory} / id/{reso 
Example: ws.bioinfo.cipf .es/cellbase /rest /I at est //isa/regui a tory/tf/t/SFJ /tffas 



ce} 



Example: ws .bioinf o.cipf .es/cellbase/rest/ latest /hsa/ regulatory /inirna_gene/hsa-inir-14 9 /disease 
network ws . bioinf o .cipf .es/cellbase/rest/ {version} /{species} /network/ {subcategory} /id/ {resource} 
Example: ws .bioinf o.cipf. es/cellbase /rest /latest /hsa /network /pathway /list 

Example: ws .bioinf o .cipf .es/cellbase/rest / latest /hsa/ network /pathway /Triacylglycero l%20biosynthes is/ image 



Server 
Client 



Wa Programmatic access 

CLI client has been implemented 

1 



_^TXTorJSON_- - 

Web browser access 




^ ^ \ Usage in web applications 



Figure 1. Schema of CellBase architecture of RESTful web services. 



access to disk. CellBase Apache Tomcat instances are 
running in a cluster with three high-end servers with an 
Intel Xeon Quad-Core and 6GB of memory each. 



'cloud'-based applications or companies in which heavy 
data are not moved through the web and is always 
up-to-date available by RESTful web services API. 



DATA ISSUES IN BIOLOGY 

Biology is experiencing an unprecedented growth of data 
and resources, what is making very difficult the access to 
local or conventional databases. New 'cloud computing' 
paradigm allows storing big data in remote servers and 
using web services to access and retrieve data efficiently. 
By doing so, researchers do not need to download, parse 
or integrate different sources since data are always 
up-to-date and can be retrieved by different chent appli- 
cations. Here, we follow a philosophy similar to some 



WEB SERVICES 

The CellBase RESTful web services API is an intuitive 
collection of methods that allow the user searching and 
retrieving biological data in a user-friendly way. RESTful 
calls are easily accessed using Universal Resource 
Locators (URLs), reducing the access to biological infor- 
mation to a simple browser query. Output results can be 
retrieved in text or JSON formats. Today, all 
programming languages can handle URLs, what makes 
CellBase biological information fully accessible in 
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a programmatic way. In this section, we describe URLs 
syntax and provide some usage examples and details on 
how to access data. 

Understanding the CellBase web services API structure 

Biological information can be accessed easily using URLs. 
These URLs have been designed to be flexible, neutral and 
scalable in future extensions, permitting the addition of 
new methods without altering the structure. The general 
syntax of CellBase URLs is as follows: 

ws.bioinfo.cipf es/cellbase/rest/version/species/category 
/subcategory/id/resource?filters 

The first part of the URL, ws.bioinfo.cipf es/cellbase/ 
rest, refers to the host and is fixed for all methods. The 
remainder part will vary depending on the user's query. 
'Versions' are numbered with the letter 'v' followed by a 
number (i.e. vl) or by the keyword 'latest' to access the 
current release. The 'species' field can be specified using 
the three-letter code (hsa, mmu, rno, etc.) or the 
abbreviated format (hsapiens, mmusculus, rnorvegicus, 
etc.). Currently, biological information is available for 
11 species, as described above. 'Category' field aims to 
provide a general classification for the input identifier 
according to its nature. Four main categories are 
available: 

• 'Genomic', which makes reference to genomic coord- 
inates like regions, positions or variants. 

• 'Feature', involves all elements that have a defined 
location on the genome and provides a comfortable 
way to retrieve cross references for an identifier. 

• 'Regulatory', refers to all regulatory features, including 
interactions that involve TFs and microRNAs. 

• 'Network', makes reference to different types of 
networks and pathways, including the protein 
interactome, the regulatory network and Reactome. 

The 'subcategory' field must indicate the type of the 
input identifier (gene, transcript, region, pathway, etc.). 
Users can choose the most suitable one among the prede- 
fined subcategories. 'Id' is the query parameter, the feature 
or term about which the user wants to retrieve the 



information. Users can query more than one identifier 
using a comma separated hst. The 'resource' field refers 
to the information the user wants to obtain from the id 
field. Depending on the category and the subcategory 
specified, different 'resources' are available. Resources 
must always be written in singular. Table 1 summarizes 
a representative selection of the available resources for 
each category and the corresponding subcategories. 
Identifier types and formats for each subcategory are 
also described. Users can add 'filters' to the RESTful 
query after the question mark. Some of them can be 
applied to all queries, but some of them are specific for 
each subcategory. An example of the options that can be 
applied to all queries are the output format, coded as 'of, 
and the character used to separate columns in the resulting 
output, coded as 'separator'. 

Additional web services have been designed to provide 
metadata about CellBase and web services themselves hke 
retrieving information about the available species, genome 
assembhes or species codes. In addition, column headers 
and usage about subcategories have been added to provide 
an onhne help to users. 

Detailed and up-to-date information of currently 
available methods can be accessed by visiting the docu- 
mentation page at http://docs.bioinfo.cipf es/projects/ 
cellbase/wiki. 

Examples of CellBase RESTful web services queries 

To provide a clarifying idea about what these web services 
are able to do, here we show some examples of usage. 

Example 1. To obtain the SNPs of a particular region 
(i.e. chromosome 16 from 3698105 to 3701105), we use 
the genomic category since the specified inputs are 
genomic coordinates: 

http://ws.bioinfo.cipf.es/cellbase/rest/latest/hsa/genomic/ 
region/16:3698105-3701105/snp 

Example 2. Retrieving all transcripts for a gene is 
straightforward using the feature category: 

http://ws.bioinfo.cipf.es/cellbase/rest/latest/hsa/feature/ 
gene/BRCA2/transcript 



Table 1. Summary of some of the main available categories and subcategories 



Category 


Subcategory 


Identifier format 


Resources 


Genomic 


Region 


chr:start-end 


gene, transcript, snp, sequence, reverse, tfbs, mirna_target, regulatory 




Variant 


chr:position:new allele 


consequence_type 




Position 


chr:position 


gene, snp, mutation, functional 


Feature 


Gene 


All gene ID formats 


info, sequence, transcript, tfbs, mirna_target, protein_feature, snp, mutation 




Transcript 


Ensembl or RefSeq ID 


info, gene, sequence, exon 




Snp 


dbSNP or Ensembl ID 


info, consequence_type, population_frequency, phenotype, xref 




Exon 


Ensembl ID 


info, sequence, region, transcript 




Protein 


UniProt or Ensembl ID 


info, gene, sequence, transcript, feature, xref, variant 




Id 


All possible IDs 


Xref 


Regulatory 


mirna_gene 


miRBase gene ID 


info, gene, mirna_mature, target, disease 




mirna mature 


miRBase mature ID 


info, gene, mirna_gene, target, disease 




Tf 


TF or gene name 


info, tfbs, gene, protein, pwm 


Network 


Pathway 


none 


List 






Reactome pathway name 


info, subpathway, element, gene, protein, image 




Interactome 


UniProt or Ensembl ID 


info, element, neighbourhood, adjacent, connected_component 
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Example 3. To convert a list of Ensembl gene identifiers 
to their HGNC gene symbol, CellBase implements a 
way to retrieve cross references for an identifier using 
the External references (xref) subcategory. We just 
need to specify the list of identifiers and the name of 
the database we want to convert these IDs to. The 
resulting query will look like this: 

http://ws.bioinfo.cipf.es/cellbase/rest/latest/hsa/feature/id/ 
ENSGOOOOOOl 1478,ENSG00000008382/xref?dbname = 
hgnc_symbol 

Example 4. To fetch the target genes for a particular 

miRNA like hsa-mir-150: 
http://ws.bioinfo.cipf.es/cellbase/rest/latest/hsa/regulatory 

/mirna_gene/hsa-mir-150/target 
Example 5. To download a diagram of a specific 

patliway: 

http://ws.bioinfo.cipf.es/cellbase/rest/latest/hsa/network/ 
pathway /Triacylglycerol % 20biosynthesis/image 

Example 6. To obtain all available species, codes and 
genome assemblies: 

http : / /ws . bioinf o . cipf.es/cellbase /rest/latest/species 



Using CellBase RESTful web services 

CellBase RESTful web services have been implemented 
using HTTP GET method, so they can be accessed 
directly from any web browser. We have also developed 
a Perl client Command Line Interface that can query 
CellBase web services and can parse large files of identi- 
fiers. CellBase web services are also available through 
some applications developed in our department such as 
Genome Maps (http://genomemaps.org), RENATO 
(http://renato.bioinfo.cipf.es) or VARIANT (http:// 
variant.bioinfo.cipf.es), what proves the benefits and 
potential of this implementation. 



OTHER TOOLS 

Despite other solutions could be apparently similar, they 
do not cover the wide variety of biological information 
included in CellBase. While some of them are quite 
restricted to specific biological content, others are more 
general but do not provide web services and often 
require a local instaUation. Several resources use the 
Distributed Annotation System (DAS) protocol (23) to 
export their data encoded in XML format, which is 
slower, larger and more difficult to parse than simple 
text or JSON. Similar to CellBase RESTful web services, 
DAS can distribute biological information based on 
genome and protein annotations. However, DAS 
protocol cannot handle functional annotations or 
systems biology data, such as gene ontology terms or 
protein-protein interactions, as CellBase does. Some data- 
bases like Ensembl, HapMap or Interpro export their bio- 
logical information using tools like Biomart which also 
provide data through WEB services. This can be useful 
when querying information from a single database, but 
becomes a problem when users need to link data from 
several Biomart sources. CellBase facilitates more 



complex data queries by integrating and linking together 
different biological sources. 

DISCUSSION 

In this work, we have designed and implemented CeUBase, 
a database and a collection of RESTful web services that 
enable quick and easy access to heterogeneous biological 
information. By joining data from some of the most 
relevant resources and integrating them into a single 
standardized database, CellBase provides users with a 
homogeneous RESTful web service API, with no need 
to query or download different sources. The integration 
of RESTful web services technologies has represented a 
great advantage to provide a comfortable way to query 
and retrieve this biological information using URLs. 

CellBase can be freely accessed in several ways to 
accommodate different scenarios or users. Web services 
can be consumed programmatically from any computing 
language or from a web browser; moreover, a Perl client 
has also been developed. Biological information and 
metadata services have been implemented. Database and 
web servers' infrastructure have been designed to provide 
a high-availability and high-performance solution. 

The database is maintained by a group of biologists and 
computer scientists. Regular updates will be carried out 
every few months as new data appear. More species and 
additional biological data will be also added since the 
database schema and RESTful web services have been 
designed to be scalable and cover all biological informa- 
tion requirements. CellBase has proven to be priceless in 
some of our projects such as Genome Maps, RENATO or 
VARIANT, and we are, currently, developing new 
applications using CellBase as a key part of their 
implementation. 

As more genomes are sequenced and more data are 
available, the access and transfer will become a critical 
problem. To solve these issues, we are working on 
moving database and WEB services to the cloud to be 
able to increase the physical resources. In addition, by 
doing this, different images would be installed in different 
geographic regions for faster data queries. The problem of 
data size is also affecting relational databases as they are 
approximating to the limit of data they can store in a 
single machine, to solve this problem are also exploring 
different solutions like NoSQL databases that allow to 
store some terabytes of data in a distributed way. Theses 
changes will be transparent to users as WEB services wiU 
remain unchanged. 
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